Ultra high throughput sequencing excludes MDH1 as candidate gene for RP28-linked retinitis pigmentosa.

PURPOSE
Mutations in IDH3B, an enzyme participating in the Krebs cycle, have recently been found to cause autosomal recessive retinitis pigmentosa (arRP). The MDH1 gene maps within the RP28 arRP linkage interval and encodes cytoplasmic malate dehydrogenase, an enzyme functionally related to IDH3B. As a proof of concept for candidate gene screening to be routinely performed by ultra high throughput sequencing (UHTs), we analyzed MDH1 in a patient from each of the two families described so far to show linkage between arRP and RP28.


METHODS
With genomic long-range PCR, we amplified all introns and exons of the MDH1 gene (23.4 kb). PCR products were then sequenced by short-read UHTs with no further processing. Computer-based mapping of the reads and mutation detection were performed by three independent software packages.


RESULTS
Despite the intrinsic complexity of human genome sequences, reads were easily mapped and analyzed, and all algorithms used provided the same results. The two patients were homozygous for all DNA variants identified in the region, which confirms previous linkage and homozygosity mapping results, but had different haplotypes, indicating genetic or allelic heterogeneity. None of the DNA changes detected could be associated with the disease.


CONCLUSIONS
The MDH1 gene is not the cause of RP28-linked arRP. Our experimental strategy shows that long-range genomic PCR followed by UHTs provides an excellent system to perform a thorough screening of candidate genes for hereditary retinal degeneration.

EC 1.1.1.41), was recently found to be associated with arRP, indicating a link between the Krebs cycle and retinal disease [5]. One of the genes within the RP28 linkage interval, MDH1, encodes for the cytosolic form of malate dehydrogenase (EC 1.1.1.37), which is directly connected to the Krebs cycle via the malate-aspartate shuttle. The gene products of IDH3B and MDH1 are related at an additional functional level, since malate dehydrogenase can convert the product of isocitrate dehydrogenase [6] (Figure 1). Therefore, we reasoned that MDH1 could correspond to the RP28 locus, and mutations in its sequence could be responsible for the disease in a manner similar to that of pathogenic changes in IDH3B.
A common approach to the mutational screening of candidate genes consists of sequencing their exons and immediate intron boundaries. However, since pathogenic mutations can sometimes be located deep within introns, as was recently shown for retinal degeneration genes as well [7,8], we decided to analyze the full MDH1 sequence. To circumvent problems linked to sequence length and composition and to test the potential of parallel sequencing in routine mutation detection, we used long-range PCR (LR-PCR) amplification of genomic DNA followed by ultra high throughput sequencing (UHTs).

DNA samples and long-range PCR:
The genomic DNA used in this study was part of the collection of samples that originally allowed mapping of the RP28 locus. Samples were obtained in accordance with the ethical guidelines regulating these previous investigations [3,4]. Specifically, they belonged to patient V-4 from family PMK146 [3] and patient IV-7 from family IIS-2 [4]. The entire MDH1 gene (nine exons and eight introns) as well as an additional 0.5 kb upstream and 4.7 kb downstream (23,459 bp in total, from nucleotide 63,669,094 to nucleotide 63,692,552 of chromosome 2, NCBI Homo sapiens Build 36.3, NC_000002.10) were amplified by three tiled LR-PCRs, using the primer pairs described in Table  1. PCR reaction mixes were set as specified by the standard protocol of TaKaRa LA Taq (TaKaRa Bio Inc., Shiga, Japan) using GC Buffer I, except the final reaction volume was reduced to 10 µl, and the concentration of each primer to 0.1 µM. The thermal profile common to all sets of primers consisted of an initial step at 94 °C for 1 min, followed by 30 cycles at 98 °C for 5 s and 68 °C for 15 min, and a final step at 72 °C for 10 min. Prior to sequencing, LR-PCR products were checked on agarose gel, quantified using the ImageJ software [9], and pooled in roughly equimolar quantities.
Ultra high throughput (UHT) and conventional Sanger sequencing: UHTs was performed with an Illumina Genome Analyzer (Illumina, San Diego, CA) for the DNA of patient IV-7/IIS-2 and an Illumina Genome Analyzer II (Illumina) with the paired-end procedure for the DNA of patient V-4/ PMK146, according to the manufacturer's protocols and starting from around 400 ng of LR-PCR products.
Conventional Sanger sequencing was performed using exon-specific primers ( Table 2) and the BigDye Terminator v1.1 cycle sequencing kit (Applied Biosystems, Foster City, CA) on LR-PCR products purified by treatment with ExoSAP-IT (USB, Cleveland, OH). Sequencing reactions were purified on Performa DTR columns (Edge BioSystems, Gaithersburg, MD) and run on an ABI-3130XL (Applied Biosystems).
Sequence analysis: Mapping and analysis of UHTs reads were performed independently with the use of three different software packages: FetchGWI/align0 [10,11], Maq [12], and the CLC Genomics Workbench (CLC bio, Aarhus, Denmark). In all instances, analyses were performed on the original calls generated by the Genome Analyzer. The procedure followed when using the FetchGWI/align0 packages was composed of multiple sequential steps. First, all the reads were filtered according to the Phred quality score of each nucleotide, replacing nucleotides with low scores by "Ns" and discarding reads with more than three Ns. Second, fetchGWI was used to find unique exact matches of all the retained reads. Reads that mapped within the target regions were directly used, while those mapping outside of these regions were kept as a negative control group. Third, among reads for which it was impossible to find any matches (either unique or repeated), those sharing a common 12-mer (except mono-or dinucleotide repeats) with the selected regions were kept and aligned against them by a global Smith-Waterman procedure (by align0). Fourth, reads that had a good score (no more than three mismatches) were verified not to produce a better alignment score against the negative control group kept during the first step. Finally, DNA variants were derived from sequences retained at step #1 and from the global alignment outputs produced at step #3. Mapping of sequences and mutation detection via Maq were accomplished as follows. First, Maq performed ungapped alignments of the reads on the reference sequence by retaining only reads with one or no mismatches in the first 24 bases and a maximum of one mismatch for the remaining bases. After mapping, Maq generated a consensus sequence and calculated the Phred quality score at each position of this sequence. Results from the analysis allowed the calling of genomic variations compared to reference sequence. The requirements for calling a variation were a minimum Phred quality score of 40; no insertions or deletions in the range of five bases; only an additional single nucleotide polymorphism (SNP) within a ten-base window; a minimum neighbor Phred quality score of 20; and a minimum coverage of 3×. The CLC Genomics Workbench-assisted mapping and mutation detection were conducted according to the general procedure recommended by the software house. More specifically, the raw sequences first underwent a filtering and trimming process based on the quality of their scores (quality score cutoff=0.01, maximum number of ambiguous nucleotides=2). All reads satisfying these criteria were then mapped on the 23,459-bp reference sequence mentioned above using the Reference Assembly algorithm within the UHTs module of the CLC package (mismatch cost=2, indel costs=3). Detection of variants was performed by identifying on the final mapping all bases for which 60% or more of the calls differed from the reference sequence, regardless of the base coverage.
Electropherograms generated by Sanger sequencing were analyzed either by the Staden software package [13] or the CLC Genomics Workbench (CLC bio).

RESULTS
We sequenced 23,459 bp of human chromosome 2, which span the MDH1 gene, in patients V-4/PMK146 and IV-7/IIS-2 by UHTs. All three software packages used to filter and align the reads provided essentially the same results and identified the same DNA variants, all of which were intronic ( Table 3). As expected, affected members from the two consanguineous families displayed homozygosity for all DNA changes present in the region. This supports the hypothesis that two copies of the same recessive mutation were inherited by patients from a single ancestor who was common to both their paternal and maternal branches, as previous microsatellite analyses have suggested [3,4]. Since all of the identified variants represented either known SNPs or changes with no predicted effects on splicing [14], none of them was considered potentially pathogenic. Sanger sequencing of all MDH1 exons and their intron vicinities confirmed the absence of DNA variants with respect to the reference sequence, apart from seven SNPs already detected by UHTs: rs10469944, rs2305157, rs7606045, rs2604613, rs262472, rs262473 (in V-4/ PMK146) and g.42648177T>A (in IV-7/IIS-2).
The large majority of UHTs reads were of good quality, as only a minimal part of them were discarded by the filtering procedures. Furthermore, filtered sequences could be easily mapped back to the reference sequence of the human genome, with an overall success rate of about 97%. The average length of the reads (30.5 bases) was also satisfying and, given the amount of sequences that were generated, produced a very high mean base coverage. More specifically, sequencing runs produced 5.2 million and 4.8 million raw sequences when the LR-PCRs from patients V-4/PMK146 and IV-7/IIS-2, respectively, were used as a template. Following the Maq procedure, 5.1 million and 4.5 million sequences from these two samples were retained after filtering for quality, generating an average coverage of 7,600× and 6,700× per given base. Slightly lower values were obtained with the CLC bio and FetchGWI/align0 packages. For CLC bio, 4.9 million (V-4/PMK146) and 3.8 million (IV-7/IIS-2) reads from the raw sequences satisfied filtering criteria, producing an average coverage of 7,000× and 4,700×, respectively. For the FetchGWI/align0 procedure, the corresponding values were 5.0 million and 3.1 million sequences, resulting in a mean coverage of 6,500× and 3,700×.
To determine the limits of UHTs as a potential tool for routine mutation detection in a homozygous context, we simulated various coverage values by randomly selecting 500,000, 50,000, 25,000, 10,000, and 5,000 filtered reads from the V-4/PMK146 sample. We then remapped them onto the 23,459-bp reference contig with the CLC bio software, exactly as performed on the original set of sequences. Although the average coverage was substantially different across the sets, as well as proportional to the simulated coverage (Table 4), we did not observe any dramatic reduction in power for the correct detection of SNPs. Specifically, whenever coverage of at least 2× could be preserved, all SNPs were correctly called (Table 4). Below this threshold, false positives began to appear, as expected, since technical sequencing errors could not be verified by any overlapping sequence (data not shown).

DISCUSSION
Identification of RP genes can be performed via genome-wide linkage analyses (when large families with multiple affected individuals are available), candidate gene screenings (when several dozen unrelated patients can be investigated), or a combination of both techniques. Good candidate genes share sequence similarities with other disease genes, relate to the same biochemical pathway, or are expressed mainly or exclusively in the affected tissue or organ [15,16]. To increase the chances of gene identification, the systematic screening of all genes present within an interval defined by linkage mapping has also been adopted, allowing the discovery of mutations in genes that, a priori, were not considered prime candidates [17,18].
Regardless of the strategy used to select the candidate genes, in most cases only exonic and nearby intronic sequences are investigated, for a few practical reasons. Specifically, exons and surrounding splicing signals are more likely to harbor disease-causing mutations than deep intronic regions. In addition, they are shorter and cheaper to analyze as well as easier to sequence, since they rarely contain repeats or low-complexity regions. However, by definition, this strategy prevents the identification of deep intronic mutations.
In this study, we perform a mutational screening on a candidate gene that maps within a previously identified linkage interval by using UHTs to analyze its introns and exons. We exclude MDH1 as a candidate gene for RP28linked arRP and determine that patients from the only RP28 families identified so far carry different haplotypes in this region. Under the assumption that the disease is caused by an RP28 mutation in both families, our finding suggests that these Indian pedigrees are truly unrelated. The absence of a common disease haplotype implies the absence of a founder mutation and, furthermore, does not exclude the possibility that two distinct arRP loci could lie in this region. Therefore, the 13 other genes present in the RP28 interval and expressed in the retina remain to be screened, likely by a global and blind sequencing strategy. Unlike MDH1, no biologic feature currently points to another promising candidate. Since several genes have already been shown to cause RP despite their apparent irrelevance to retinal physiology as well as their ubiquitous expression, the hypothesis that a non-obvious candidate could be the RP28 gene is not particularly surprising.
However, in addition to these specific results, our work can be considered a proof of concept experiment for the use of highly parallel sequencing techniques for mutation detection in monogenic diseases. The assumption that UHTs could replace conventional sequencing in human gene screening procedures is not completely straightforward, since the former technique has a series of defects (e.g., short read length and relatively poor accuracy in base calling) that render it not particularly suitable for the analysis of the human genome [19]. Although we have willingly reduced the complexity of our analyses by selecting homozygous DNA regions in patients from consanguineous families, our results indicate that mutation detection via UHTs could be performed even when coverage is relatively low. Factors such as heterozygozity (e.g., in dominant diseases) or pooling of multiple samples unquestionably increase the noise and decrease the power of UHTs analyses. However, recent studies have shown that, when specific countermeasures are adopted, these confounding elements can be reduced or eliminated [20][21][22].
In summary, we believe that the strategy used for the analysis of the MDH1 gene likely represents one of the technical approaches to future candidate gene screening in monogenic conditions. Since retinitis pigmentosa and allied diseases show great genetic heterogeneity, the screening of large sets of patient DNA samples is required. Therefore, UHTs, in combination with systematic long-range PCR [23][24][25] or genomic sequence capturing [26,27], can be particularly relevant in the analysis of this disorder.

ACKNOWLEDGMENTS
We would like to acknowledge Dr. Gábor Csárdi for Python programming and Fasteris SA, Plan-les-Ouates, Switzerland, for help on UHT sequencing. Part of the analyses were performed at the Vital-IT Center for high-performance computing of the Swiss Institute of Bioinformatics. Our work was supported by the Swiss National Science Foundation